Historic,  archived  document 

Do  not  assume  content  reflects  current 
scientific  knowledge,  policies,  or  practices. 


4 

i 


>outhwest 


Great 
Plains 


Research  Note  RM-512 

October  1991 

USDA  Forest  Service 

Rocky  Mountain  Forest  and 
Range  Experiment  Station 


The  Effect  of  Training  Block  Size  on  Unsupervised 
Classification  of  Landsat  Thematic  Mapper  Imagery 

Paul  W.  Snook1 

This  study  examined  the  effect  of  pixel  block  size  on  the  develop- 
ment and  labeling  of  spectral  signatures  from  an  unsupervised  clas- 
sification of  Landsat  data.  Results  indicated  no  significant  difference 
in  classification  of  Thematic  Mapper  (TM)  data  based  upon  the  size  of 
blocks  input  to  the  classifier. 

Keywords:  Land  cover  classification,  satellite  data,  training  block  size 


Introduction 

Forest  inventory  data  in  the  southern  United  States  are 
supplied  on  a  6-  to  8-year  cycle  by  the  USDA  Forest  Serv- 
ice Forest  Inventory  and  Analysis  Units  (FIA).  Econom- 
ic and  climatological  fluctuations  can  cause  significant 
shifts  in  land  use  and  land  cover  during  the  FIA  survey 
cycle.  Budget  restraints,  however,  restrict  the  use  of 
more  frequent  ground  based  surveys.  Data  for  mapping, 
monitoring,  and  assessing  the  condition  of  natural 
resources  has  commonly  been  augmented  by  remotely 
sensed  sources.  Remotely  sensed  data  provide  efficient 
coverage  of  vast  areas  and  inaccessible  locations,  there- 
by reducing  the  time  and  cost  of  ground  sampling.  Satel- 
lites provide  imagery  on  a  national  scale  with  a  rapid 
repeat  cycle.  This  imagery  can  provide  the  temporal 
resolution  and  reduced  cost  to  monitor  and  predict  rates 
of  change  in  the  land  base  on  a  short-term  basis. 

Computer  assisted  classification  of  digital  satellite  im- 
agery is  based  upon  developing  distinct  spectral  group- 
ings (signatures)  which  can  be  associated  with  specific 
land  features.  Such  classification  of  satellite  data  can  be 
categorized  into  two  general  strategies;  supervised  and 
unsupervised.  Supervised  classification  requires  the 
analyst  to  select  predetermined  information  classes. 
Areas  of  known  land  features  are  identified  on  the  satel- 
lite image  and  pixels  that  represent  that  area  are  used 
to  develop  a  spectral  signature  (spectral  characteristics 
of  a  land  feature) .  After  spectral  signatures  are  developed 
for  all  land  categories  of  interest,  unknown  areas  are  clas- 


1  Research  Biologist,  Rocky  Mountain  Forest  and  Range  Experiment 
Station;  headquarters  is  in  Fort  Collins,  in  cooperation  with  Colorado  State 
University. 


sified  into  one  of  the  identified  categories  of  interest 
based  on  the  spectral  signatures. 

Unsupervised  classification  does  not  require  that  spec- 
tral signatures  be  developed  by  the  analyst.  Spectral  sig- 
natures are  developed  from  all  available  pixels  and 
separated  into  clusters  with  user-defined  mean  and  var- 
iance limits.  The  clusters  developed  in  this  manner  are 
identified  after  classification  using  available  reference 
data  sources. 

The  purpose  of  either  strategy  is  to  develop  spectral 
signatures  that  represent  unknown  areas.  Ideally,  a  large 
sample  should  be  taken  to  provide  a  robust  set  of  signa- 
tures. This  set  of  signatures  must  be  able  to  represent 
subtle  differences  within  broad  land  cover  features  over 
extended  areas.  For  example,  within  a  pine  forest  the 
canopy  closure,  size  class,  and  species  composition  may 
vary  spatially.  If  this  feature  is  extracted  in  a  supervised 
approach,  by  delineating  the  stand  as  a  unit,  a  signature 
is  developed  that  represents  the  stand  as  a  whole. 
However,  the  stand  is  actually  composed  of  varying 
features,  each  with  a  different  spectral  response.  An 
unsupervised  approach  could  produce  spectral  signa- 
tures for  the  separate  features.  This  finer  level  of  detail 
can  be  extended  to  unknown  areas  better  than  compos- 
ite signatures  developed  from  the  supervised  approach. 

The  reference  data  (thematic  land  cover  maps  devel- 
oped from  photointerpretation  of  Medium  Scale  Photog- 
raphy (MSP))  appears  perfectly  suited  for  a  supervised 
approach.  Areas  of  known  types  exist  as  polygons  from 
the  reference  data.  Pixels  found  within  an  overlay  of  the 
reference  data  would  be  used  to  develop  spectral  signa- 
tures. This  approach,  however,  has  four  drawbacks: 

1 .  Precise  registration  of  the  polygons  within  an  over- 
lay plot  onto  the  TM  digital  image  is  required.  Catts  et 


1 


al.  (1987),  in  a  photointerpretation  study  in  North  Caro- 
lina, report  that  shifts  as  minor  as  4  to  9  meters  can  result 
in  a  shift  from  one  cover  type  to  another. 

2.  The  photointerpretation  process  groups  land  cover 
features  of  a  common  type  into  a  polygon.  The  spectral 
reflectance  of  a  single  cover  type,  however,  can  vary 
within  the  polygon  depending  on  the  site  condition  or 
the  type  of  cover  class  (particularly  urban  and  mixed 
forest  types).  Developing  spectral  signatures  based  on 
the  combination  of  all  pixels  within  a  polygon  provides 
only  an  "average"  spectral  signature  which,  therefore, 
does  not  actually  represent  the  components  of  the  cover 
type  accurately,  and  will  not  extend  well  to  unknown 
areas. 

3.  The  amount  of  effort  required  to  digitize  the  MSP 
polygons  is  extensive,  and  more  so  when  additional 
plots  are  required  to  provide  an  adequate  sample. 

4.  The  classification  system  used  for  the  reference 
data  is  best  developed  with  spectral  response  in  mind, 
which  requires  consideration  of  features  that  are  not 
directly  required  by  the  user's  classification  system. 

The  number  of  400-ha  plots  in  the  TM  image 
represented  only  a  small  portion  of  available  data.  The 
more  diverse  the  data  input  into  the  classifier,  the  greater 
the  confidence  that  the  spectral  signatures  developed 
represent  the  land  cover  features  found  throughout  the 
image.  Blocks  of  data  greater  than  the  400-ha  plot  itself 
input  into  the  classifier  would  increase  this  confidence, 
but  reliable  reference  sources  for  these  extended  blocks 
do  not  exist.  All  spectral  signatures  developed  during 
classification  of  the  extended  blocks  need  to  be  found 
within  the  400-ha  plot  itself  to  ensure  positive  labeling 
of  that  spectral  signature.  Conversely,  it  is  possible  that 
adequate  spectral  variation  could  be  found  within  the 
selected  400-ha  sample  plots.  Optimum  balance  between 
excessive  data  and  representative  spectral  signatures 
needs  to  be  determined. 

Objective 

The  objective  of  this  study  is  to  determine  the  effect 
of  pixel  block  size  on  the  development  and  labeling  of 
spectral  signatures  produced  from  an  unsupervised  clas- 
sification of  Landsat  Thematic  Mapper  data  and  the  sub- 
sequent classification  of  independent  test  plots.  Differ- 
ences in  classification  accuracy  based  upon  spectral 
clusters  developed  from  400-ha  and  5,200-ha  block 
sizes  will  be  compared.  The  ability  to  label  clusters  from 
within  the  400-ha  reference  plot  will  also  be  observed 
for  both  block  sizes. 


Methods 

The  data  selected  for  this  study  was  TM  scene  num- 
ber Y5050615222XO  of  the  Raleigh,  N.  C.  area,  acquired 
October  8,  1985.  The  image  file  consisted  of  6,967  pix- 
els by  5,965  lines.  The  image  contained  approximately 
10%  cloud  cover.  A  subset  of  three  TM  bands  were  used 
for  the  classification;  TM3,  TM4,  TM5. 


There  were  79  400-ha  sample  reference  plots  within 
the  TM  scene.  Each  sample  plot  had  stereo  MSP  cover- 
age, Aerochrome  2443  CER  9X9  transparencies  available 
that  were  obtained  during  late  October  through  early 
November  1985.  A  random  sample  of  15  sample  plots 
was  selected  to  test  the  differences  between  block  sizes . 
The  image  was  displayed  on  a  red-green-blue  (RGB) 
monitor,  and  National  High  Altitude  Photography 
(NHAP)  and  MSP  photos  were  used  to  determine  if  the 
photoplot  had  been  accurately  located.  The  x,  y  pixel 
coordinates  of  plot  center  were  made  with  a  roving 
cursor. 

The  block  sizes  chosen  for  testing  were  80  x  80  pix- 
els (560  nominal  hectares),  hereafter  BLOCK-1,  and  256 
x  240  pixels  (5,200  ha),  BLOCK-2,  which  contained  the 
true  400-ha  circular  plot.  The  larger  block  was  based  on 
the  maximum  physical  display  capabilities  of  the  RGB 
display  monitor.  From  the  selected  15  sample  plots,  6 
plots  were  further  selected  at  random  for  use  as  train- 
ing plots.  Classification  statistics  developed  from  the 
training  plots  were  then  extended  to  the  remaining  nine 
plots  for  testing  the  effects  of  the  block  size  on  classifi- 
cation accuracy. 

Unsupervised  classification  was  performed  by  a  mini- 
mum distance  to  mean  clustering  algorithm.  All  pixels 
from  the  three  spectral  bands  within  the  six  training 
blocks  were  input  to  the  classifier.  Classification  statis- 
tics for  each  block  size  were  developed  independently 
of  each  other.  Classification  results  of  each  plot  were  out- 
put to  a  plotter  as  a  cluster  map.  Clusters  developed  dur- 
ing classification  were  labeled  to  the  appropriate  cover 
type  (table  1)  by  comparing  corresponding  spatial  loca- 
tion on  the  cluster  map  with  the  400-ha  plot  interpreted 
from  the  MSP.  All  training  plots  were  observed  in  the 
labeling  process  to  ensure  consistency  in  cluster  defini- 
tion among  sample  plots;  i.e.,  a  cluster  was  assigned  to 
the  cover  type  it  was  most  commonly  associated  with 
over  all  sample  plots.  Those  clusters  which  could  not 
be  consistently  labeled  into  a  single  class  were  labeled 
"unclassified."  The  remaining  test  plots  were  classified, 
using  a  minimum  distance  classifier,  after  cluster  statis- 
tics were  generated  and  labeled. 

Proportional  estimates  of  land  cover  for  each  class 
within  the  nine  sample  test  plots  were  calculated  for  the 
400-ha  plot  based  upon  classification  of  each  block  size 
and  MSP  (reference).  A  truer  test  on  the  effect  of  block 
size  would  be  based  on  matched  observations  from  in- 
dividual plots.  A  random  sample  of  registered  points  be- 

Table  1 . — Revised  classification  scheme  used  for  labeling  clusters  de- 
veloped from  an  unsupervised  classification  of  TM  data  from  the  Pied- 
mont of  North  Carolina. 


Level  1 

Level  II 

(Forest) 

Cropland 

Pine 

Grassland 

Oak-pine 

Urban 

Hardwoods 

Water 

Nonstocked 

Forest 

2 


tween  reference  and  classified  images  from  BLOCK-1 
and  BLOCK-2  would  provide  a  rigorous  test  of  accura- 
cy. However,  the  registration  of  such  points  on  a  land- 
scape as  diverse  as  the  Piedmont  of  North  Carolina 
produced  an  unacceptable  amount  of  misregistration  be- 
tween the  sources  of  interest.  The  best  analysis  avail- 
able was  based  on  comparisons  of  percent  area  from  each 
plot. 

Because  of  the  dependence  of  errors  obtained  from 
proportional  estimates  of  plot  data  (i.e.,  one  misclassifi- 
cation  must  be  accounted  for  by  another  class)  a  non- 
parametric  analysis,  Multi-Response  Randomized  Block 
Permutation  (MRBP)  procedure  (Mielke  et  al.  1981)  was 
used  to  assign  significance  levels  to  block  size  compar- 
isons. This  test  measures  the  level  of  agreement  between 
classifications  and  how  the  agreement  compares  to  ran- 
dom chance.  The  MRBP  compared  differences  between 
classifications  in  euclidian  distance.  The  null  hypothe- 
sis was  stated  as  no  difference  between  unsupervised  and 
reference  classifications,  and  the  probability  value  was 
given  by  P(5e  <  S0)  where  5e  =  permuted  differences 
and  50  =  observed  difference.  The  analysis  contains  a 
measure  of  agreement  ranging  from  0.0  to  1.0,  where  1.0 
indicated  perfect  agreement.  Small  P  (P  <  a,  where  a 
=  0.05),  indicates  that  there  was  no  difference  between 
treatments.  Differences  among  the  treatments  is  signi- 
fied by  increasing  values  of  P. 


Results  and  Discussion 

Table  2  represents  average  class  results  from  the  six 
training  plots  and  the  nine  test  plots.  Although  conclu- 
sions can  only  be  drawn  from  the  test  plots,  results  from 
the  training  plots  were  presented  to  demonstrate  the 
variability  in  classification  accuracy  between  training 
and  test  plots.  These  values  provide  an  initial  assessment 
to  the  classification  accuracy  based  on  spectral  signa- 
tures developed  from  each  block  size.  Of  major  note  was 
the  unclassified  category.  The  classification  based  on 
BLOCK-1  produced  clusters  which  could  not  be  labeled 
consistently  among  reference  plots  into  one  of  the  prede- 
fined classes.  BLOCK-2  classification  shows  no  unclas- 


40 


Cropland 


Block-2  40 


Block- 1 


Block-2 


Figure  1  .—Percent  area  from  reference  (Y  axis)  and  test  data  (X  axis) 
for  designated  classes. 

sified  areas  occurred  on  the  plots.  The  unclassified 
clusters  tended  to  represent  grassland  and  forest  cover 
types  equally  and  could  not  be  assigned  to  any  particu- 
lar class. 

Figure  1  plots  each  variable  from  the  nine  test  plots; 
MSP  vs.  BLOCK-1  and  BLOCK-2  results.  A  45°  "no 
difference"  line  was  included  to  aid  interpretation  of  the 
graphs.  Both  block  sizes  consistently  underestimate 
cropland,  urban  and  oak-pine,  and  overestimate  grass- 
land and  hardwood.  In  general  urban,  pine  and  hard- 
wood classes  tended  to  follow  the  no-difference  line 
closer  from  BLOCK-2  classification  than  from  BLOCK-1 
classification.  Classification  of  nonstocked  was  better 
from  BLOCK-1.  The  four  forest  subclasses  taken  as  a 
whole  (forest)  indicated  that  BLOCK-2  produced  fewer 
differences  than  BLOCK-1. 

Results  indicate  that  classification  into  oak-pine  and 
urban  classes  was  difficult.  This  was  partly  because  of 
the  spatial  resolution  of  the  TM  sensor  and  the  land  use 
definition  of  these  classes.  The  spatial  resolution  of  the 
TM  increases  the  number  of  pure  pixels,  those  pixels 
representing  a  single  ground  feature.  Such  land  use 
classes  as  oak-pine  and  urban  are  actually  composed  of 
multiple  ground  features.  The  TM  detects  the  separate 


Table  2.—  Average  plot  estimates  from  reference  (MSP)  and  training  blocks  for  the  six  training  and 

nine  test  plots. 


CLASS  TYPE 

REF.  (MSP) 

BLOCK-1 

BLOCK-2 

Training 

Test 

Training 

Test 

Training 

Test 

Cropland 

14.10 

18.04 

13.27 

13.10 

13.10 

15.38 

Grassland 

10.62 

10.12 

14.56 

13.50 

13.49 

14.93 

Urban 

18.44 

10.22 

3.85 

1.43 

6.74 

4.40 

Water 

1.16 

1.00 

0.77 

0.84 

1.33 

0.94 

Pine 

26.11 

27.22 

21.41 

18.84 

31.51 

27.26 

Oak-pine 

13.14 

20.17 

19.72 

11.60 

13.44 

8.01 

Hardwood 

16.41 

14.34 

15.43 

32.57 

13.83 

27.06 

Nonstocked 

2.63 

0.21 

1.62 

0.80 

1.38 

1.87 

Unclassified 

9.33 

6.03 

0.00 

0.00 

FOREST  TOTAL 

55.66 

61.73 

67.51 

69.04 

60.20 

64.23 

3 


components  which  are  classified  into  the  separate 
ground  features.  The  results  bear  witness  to  this  condi- 
tion; both  oak-pine  and  urban  were  significantly  un- 
derestimated. 

Observations  made  during  the  labeling  process  indi- 
cated that  only  those  clusters  representing  seedling/ 
sapling  oak-pine  stands  could  be  reliably  identified.  The 
principal  reason  was  that  the  integrated  spectral  reflect- 
ance from  the  seedling/saplings,  the  understory,  and  soil 
produced  a  spectral  signature  different  enough  from 
other  forest  signatures  to  allow  an  oak-pine  classifica- 
tion. Similar  conditions  existed  in  the  urban  class.  Only 
impervious  surfaces,  such  as  wide  roads,  parking  lots, 
or  large  buildings  were  classified  as  urban.  The  photo- 
interpreted  reference  data,  however,  identified  urban 
and  residential  areas  with  significant  tree  cover,  or  vege- 
tated right  of  ways,  as  urban. 

A  test  of  significance  over  all  classes  from  each  plot 
was  conducted  using  MRBP.  The  agreement  values  for 
the  block  comparisons  (table  3)  indicated  low  agreement 
for  all  comparisons.  Probability  values  were  less  than 
the  selected  a  (0.05),  concluding  that  there  were  no  sig- 
nificant differences  between  MSP  and  the  two  block 
sizes.  The  comparisons  between  block  size  classification 
and  MSP  reference  data  were  very  similar,  indicating  no 
difference  in  estimating  reference  data  with  either  block 
size  classification.  As  would  be  expected,  the  agreements 
between  BLOCK- 1  and  BLOCK-2  classifications  were  the 
strongest,  indicating  that  the  misclassifications  were 
caused  by  spectral  inseparability  of  cover  types  rather 
than  the  sample  size  used  to  develop  spectral  signatures. 
Although  MRBP  indicates  significant  agreement 
between  MSP  and  block  sizes,  the  measure  of  agree- 
ment statistic  indicates  that  practical  significance  is 
questionable. 

The  labeling  of  clusters  when  based  on  data  not  in- 
cluded in  the  reference  set  was  a  major  concern  for  using 
the  larger  block  size.  Four  spectral  clusters  out  of  a  total 
of  65  from  BLOCK-1  did  not  occur  in  areas  having  refer- 

Table  3.— MRBP  agreement  and  probability  results  for  MSP  (reference) 
and  BLOCK-1  and  BLOCK-2.  Agreement  values  range  from  0  to  1, 
where  1  is  perfect  agreement.  P-values  less  than  a  =  0.05  indicate 
a  statistically  significant  agreement  between  classifications. 


MSP  Block-1  Block-2 


Block-1 

Agreement 

0.1486 

P-value 

0.0039 

Block-2 

Agreement 

0.1353 

0.3247 

P-value 

0.0281 

0.0014 

ence  data.  Unlabeled  clusters  should  not  be  confused 
with  the  "unclassified"  label.  The  occurrence  of  these 
clusters  on  BLOCK-1  plots  was  a  result  of  the  scarcity 
of  these  clusters.  Precise  labeling  phase  was  not  possi- 
ble and,  therefore,  clusters  could  not  be  properly  as- 
signed to  any  given  class  type.  These  clusters  represent 
0.81%  of  all  pixels  input  for  final  classification.  Seven- 
teen clusters  were  not  labeled  out  of  85  from  BLOCK-2. 
Of  these,  12  clusters  represented  cloud  coverage.  Be- 
cause any  plot  that  exhibited  cloud  influence  (cloud 
covered  or  shadowed)  was  dropped  from  analysis,  these 
clusters  were  not  included.  The  remaining  five  clusters, 
representing  0.75%  of  all  classified  pixels,  could  not  be 
labeled  using  the  reference  plot.  Although  not  conclu- 
sive, this  demonstrates  that  clusters  developed  from 
blocks  representing  areas  larger  than  the  reference  plot 
can  be  located  and  labeled,  as  can  clusters  developed 
from  blocks  representing  the  reference  plots  alone. 

Conclusions 

There  was  no  significant  difference  in  the  develop- 
ment, labeling,  and  classification  of  TM  data  from  the 
Piedmont  of  North  Carolina  based  upon  the  size  of 
sample  blocks  input  to  the  classifier.  However,  small  im- 
provements in  classification  were  seen  for  the  urban, 
pine,  and  hardwood  categories  using  the  larger  block 
size.  This  could  be  associated  with  the  increased  area 
sampled.  The  greater  the  area  sampled,  the  greater  the 
chance  to  obtain  spectral  signatures  of  diverse  land  cover 
conditions.  The  addition  of  more  reference  plots  to  in- 
crease the  sample  size,  however,  would  be  prohibitive- 
ly expensive  when  equivalent  information  could  be 
obtained  from  larger  blocks  surrounding  fewer  reference 
plots.  There  was  no  difference  in  the  registering  and 
labeling  of  clusters  developed  between  the  two  block 
sizes.  The  use  of  extended  blocks  surrounding  reference 
data  ensures  better  use  of  the  available  satellite  data  than 
using  reference  data  alone. 

References 

Catts.  G.  P.;  Cost,  N.  D.;  Czaplewski,  R.  L.;  Snook.  P.  W. 
1987.  Preliminary  results  from  a  method  to  update 
timber  resource  statistics  in  North  Carolina.  In: 
Proceedings.  11th  biennial  workshop  on  color  aerial 
photographv  in  the  plant  sciences:  1987  April  27-May 
1;  Westlaco,  TX. 

Mielke,  P.  W.;  Berry,  K.  J.;  Williams.  J.  S.  1981.  A  class 
of  nonparametric  tests  based  on  multi-response  per- 
mutation procedures.  Biometrika.  68:  720-724. 


4 


